Project Description

KKBox is a leading music streaming service in Asia, offering millions of people access to their music library by advertising and paid subscription. The paid users make up an important proportion of their profits.

The aim of this project is to develop predictive models to help determine whether a user will churn after their subscription expires. We have deployed three classification models; Logistic Regression, Decision Tree and XGboost, as well as performing an Explanatory Data Analysis to understand data distribution and the relationships between features.

In this report we will use the dataset provided by Kaggle.

Datasets

Our dataset is defined for KKBox’s Churn Prediction Challenge in the Kaggle competitions. You can find more information about challenge and data by clicking here. There are 4 main dataset included in the competition which we are going to use in our term project.

Members Dataset

This dataset contians customer personal information.

## Rows: 6,769,473
## Columns: 6
## $ msno                   <chr> "Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8...
## $ city                   <int> 1, 1, 1, 1, 6, 4, 1, 5, 5, 13, 1, 22, 4, 4, ...
## $ bd                     <int> 0, 0, 0, 0, 32, 30, 0, 34, 19, 63, 0, 18, 34...
## $ gender                 <chr> "", "", "", "", "female", "male", "", "male"...
## $ registered_via         <int> 11, 7, 11, 11, 9, 9, 7, 9, 9, 9, 7, 9, 9, 9,...
## $ registration_init_time <int> 20110911, 20110914, 20110915, 20110915, 2011...

First 10,000 rows of member dataset:

Nan values of dataset:

##                        NaN Count
## msno                           0
## city                           0
## bd                             0
## gender                         0
## registered_via                 0
## registration_init_time         0

Log Dataset

This dataset contains users music listening history(logs) between 01/03/2017 and 31/03/2017.

## Rows: 18,396,362
## Columns: 9
## $ msno       <chr> "u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=", "nTeWW/e...
## $ date       <int> 20170331, 20170330, 20170331, 20170331, 20170331, 201703...
## $ num_25     <int> 8, 2, 52, 176, 2, 3, 9, 181, 3, 5, 7, 78, 7, 1, 5, 40, 4...
## $ num_50     <int> 4, 2, 3, 4, 1, 0, 1, 68, 0, 4, 1, 5, 1, 0, 3, 6, 2, 1, 1...
## $ num_75     <int> 0, 1, 5, 2, 0, 0, 0, 5, 1, 1, 2, 3, 0, 0, 2, 2, 0, 0, 1,...
## $ num_985    <int> 1, 0, 3, 2, 1, 0, 0, 3, 1, 1, 7, 2, 0, 0, 1, 2, 1, 2, 0,...
## $ num_100    <int> 21, 9, 84, 19, 112, 39, 18, 54, 181, 30, 69, 8, 45, 17, ...
## $ num_unq    <int> 18, 11, 110, 191, 93, 41, 26, 291, 150, 31, 74, 85, 21, ...
## $ total_secs <dbl> 6309.273, 2390.699, 23203.337, 7100.454, 28401.558, 9786...

First 10,000 rows of log dataset:

NaN values:

##            NaN Count
## msno               0
## date               0
## num_25             0
## num_50             0
## num_75             0
## num_985            0
## num_100            0
## num_unq            0
## total_secs         0

Transaction Dataset

This dataset contains the transactions of customers between 01/01/2015 and 31/03/2017.

## Rows: 1,431,009
## Columns: 9
## $ msno                   <chr> "++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134...
## $ payment_method_id      <int> 32, 41, 36, 36, 41, 41, 41, 41, 41, 41, 41, ...
## $ payment_plan_days      <int> 90, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ plan_list_price        <int> 298, 149, 180, 180, 99, 149, 99, 99, 99, 149...
## $ actual_amount_paid     <int> 298, 149, 180, 180, 99, 149, 99, 99, 99, 149...
## $ is_auto_renew          <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ transaction_date       <int> 20170131, 20150809, 20170303, 20170329, 2017...
## $ membership_expire_date <int> 20170504, 20190412, 20170422, 20170331, 2017...
## $ is_cancel              <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

First 10,000 rows of transactions dataset:

NaN values:

##                        NaN Count
## msno                           0
## payment_method_id              0
## payment_plan_days              0
## plan_list_price                0
## actual_amount_paid             0
## is_auto_renew                  0
## transaction_date               0
## membership_expire_date         0
## is_cancel                      0

Churn Dataset

This dataset basically contains the churn results for March 2017.

## Rows: 970,960
## Columns: 2
## $ msno     <chr> "ugx0CjOMzazClkFzU2xasmDZaoIqOUAZPsH1q0teWCg=", "f/NmvEzHf...
## $ is_churn <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

First 10,000 rows of churn dataset:

NaN values:

##          NaN Count
## msno             0
## is_churn         0
## # A tibble: 2 x 2
##   is_churn  Count
##      <int>  <int>
## 1        0 883630
## 2        1  87330

Exploratory Data Analysis

General Exploratory Data Analysis

Depending on the status of churn or not, the distribution of people can be observed in the graph below. It can be seen that the number of non-churn members is much higher than those of churn ones.

The distribution of the members in the data by gender is observed in the graph below. There is no big difference between male and female members. However, there are 37060 records in the data whose gender column is empty.

There are 19 different ways to register to the system. In the graph below, the distribution of users who are members of the system according to the channels they have subscribed to is observed. Among the methods used for membership, 3,7 and 9 are preferred more frequently than other methods.

The distribution of the users who are members of the system according to their cities is given below. The city with the most members in the system is the city of number 1.

The density of the members can be seen by age. The application is more preferred by the young population.

Exploratory Data Analysis of Members

The frequency of membership according to the membership dates is shown in the graphic below. Although there were occasional densities until the end of 2014, the intensity in memberships has increased noticeably since the end of 2014.

The distribution of users according to the days of their membership is shown below. It is clearly seen that membership transactions have increased over the weekend.

The distribution of users according to the months of their membership is given below. It is clearly seen that membership transactions have increased in the autumn period.

Exploratory Data Analysis of Transactions

Below is the distribution according to the preferred payment methods. Payment method number 41 is preferred by the majority of users.

Distribution of users who prefer and do not prefer auto membership renewal is as follows. The vast majority of users who use the application use auto membership renewal.

The distribution of users who cancel their membership at their own request and do not perform this operation is as follows. The number of users who cancel their membership is very low compared to those who do not.

It can be seen in the chart below that a large number of users prefer the 30-day payment plan.

Exploratory Data Analysis of User Logs

In the density chart below, the average usage times of the users are shown on the basis of hours. The median value for use time is 1.27 hours.

The density chart below shows the average number of unique songs listened to by users. The median value of the number of unique songs that users listen to is 18.93.

The density chart below shows the average distribution of the number of songs listened by users, by the percentage of completion of the songs.

Exploratory Data Analysis by Churn with Members

In the chart below,percentage of users churning is separated by gender. The values shown are almost the same. Gender does not give an insight about whether or not to churn while performing churn analysis.

Percentages of churning by membership channel are shown below. There are important differences between the values. Users using channel 4 are more likely to churn, while users using channel 7 are loyal users.

In the density chart below, users with and without churn are delivered in different colors and their age density is shown. As seen in the graph, young users are more likely to churn.

RFM Analysis

RFM analysis is one of the main applications to understand the customer behaviour and segment customers by their Recency, Frequency and Monetary scores. This analysis is very easy to implement and useful to understand customer loyalty and value. It helps to make data-driven decisions about marketing strategy, can improve CLV and prevent churn. After the RFM analysis done, we are planning to insert this analysis into our machine learning model in the Churn Prediction section.

We are going to use transaction dataset for calculating the RFM scores based on customers purchase history. We know that there isn’t any empty data in our transaction dataset, thus, we can start the pre-processing RFM table generation.

Pre-Processing

## Rows: 1,182,989
## Columns: 12
## $ msno                     <chr> "///2pAPKetZe8zPqAwkYBjAUr+4pS8Rc6bsO4eGAl...
## $ Monetary_Overall         <dbl> 99, 99, 1788, 1788, 99, 480, 149, 1788, 17...
## $ Frequency_Overall        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Recency                  <drtn> 22 days, 26 days, 21 days, 342 days, 12 d...
## $ Days_in_Firm             <drtn> 22 days, 26 days, 21 days, 342 days, 12 d...
## $ Membership_Expire_Date   <date> 2017-04-10, 2017-04-06, 2018-04-25, 2017-...
## $ Average_Auto_Renew       <dbl> 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0....
## $ Average_Cancelation      <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....
## $ Median_Payment_Plan_Days <dbl> 30, 30, 410, 410, 30, 100, 30, 410, 410, 3...
## $ Months_in_Firm           <dbl> 0.73333333, 0.86666667, 0.70000000, 11.400...
## $ Monetary_Val             <dbl> 4.500000, 3.807692, 85.142857, 5.228070, 8...
## $ Frequency_Val            <dbl> 0.045454545, 0.038461538, 0.047619048, 0.0...

In the process step, we generated new columns as Monetary_Val and Frequency_Val for calculating the RFM scores. This columns normalized versions of Monetary_Overall and Frequency_Overall columns. We rescaled these columns for each customer based on the passed day since their first transaction (Days_in_Firm).

First 10,000 row of RFM table with selected feature displayed below:

RFM Scoring

For calculating the RFM scores, we divided the each category, Recency, Frequency_Val and Monetary_Val, into 5 groups and we calculated RFM scores for each customers according to their ranks in the given columns.

Recency

Distribution of recencies in our dataset:

We are going to divide the recency values into 5 categories. Each category will contain approximately equal number of customers. Group 5 will have the lowest recency value, while group 1 will have the highest recency value. Summary of the groups are given below:

Frequency

Distribution of frequency values in our dataset:

We are going to divide the frequency values into 5 categories. Each category will contain approximately equal number of customers. Group 1 will have the lowest frequency value, while group 5 will have the highest frequency value. Summary of the groups are given below:

Monetary

Distribution of monetary values given below:

We are going to divide the monetary values into 5 categories. Each category will contain approximately equal number of customers. Group 1 will have the lowest monetary value, while group 5 will have the highest monetary value. Summary of the groups are given below:

Customers RFM Score

Finally, we can group our customer based on their RFM scores. We used median values for summarizing groups because our score values have skewed distribution.

Key Takeaways of RFM

  • We see that 5-5-5 group has the highest distribution of customers with 13.2 percentage. That group included most valuable customers in our company. Besides that, 4-4-4 group has the second highest customer distribution (9.71%). We can say that customer loyalty of the company is good.

  • However, we see that group which have least score distribution is high either. The cumulative percentage of groups which include smaller than 2 scores for each Recency, Frequency and Monetary, is %29.91 . These groups can consider as prone to churn. We can analyse these groups more deeply to understand what they did not like in our app, why they did not purchase again etc. We can simply identify our shortcomings by looking their behaviour.

  • The middle groups which have 3 or 4 scores includes the customers which we want to increase their interactions with our app. We may consider to how we can gain much more benefit from these customers and how we can increase their loyalty.

Churn Prediction

Data preprocessing includes merging and manipulating data from our four datasets, and splitting data into train and test datasets. Datasets are merged by a unique column, ‘msno’.

We are going to check multicollinearity before deploying Logistic Greression and Decision Tree models. The XGboost model can handle multicollinearity well, whereas, the other two do not.

Three machine learning models were developed in this project: Logistic Regression, Decision Tree Classifier, and XGboost.


XGboost

## [1]  train-error:0.027086 
## [2]  train-error:0.027143 
## [3]  train-error:0.025335 
## [4]  train-error:0.024846 
## [5]  train-error:0.024732 
## [6]  train-error:0.023859 
## [7]  train-error:0.022838 
## [8]  train-error:0.022968 
## [9]  train-error:0.022764 
## [10] train-error:0.022520
##                      Feature         Gain        Cover   Frequency
##  1:              days_to_exp 6.341254e-01 2.357843e-01 0.286738351
##  2:             Days_in_Firm 1.435447e-01 1.809238e-01 0.139784946
##  3:                  Recency 8.569580e-02 1.303593e-01 0.150537634
##  4:      Average_Cancelation 4.307361e-02 1.321873e-01 0.028673835
##  5:       Average_Auto_Renew 3.156932e-02 1.563641e-01 0.050179211
##  6:         Monetary_Overall 2.167204e-02 1.051637e-01 0.053763441
##  7:            Frequency_Val 1.631483e-02 2.012509e-02 0.050179211
##  8:             Monetary_Val 8.518838e-03 1.082921e-02 0.060931900
##  9: Median_Payment_Plan_Days 5.032259e-03 3.974759e-03 0.025089606
## 10:        Frequency_Overall 3.235741e-03 1.050770e-02 0.028673835
## 11:                  reg_via 2.559993e-03 2.466864e-03 0.039426523
## 12:             mean(num_75) 1.990011e-03 1.936731e-03 0.021505376
## 13:        mean(total_hours) 1.578315e-03 7.956564e-03 0.021505376
## 14:            mean(num_100) 6.669575e-04 9.570973e-04 0.007168459
## 15:            mean(num_unq) 1.888777e-04 3.512438e-04 0.021505376
## 16:             mean(num_25) 1.260189e-04 3.923106e-05 0.007168459
## 17:             mean(num_50) 7.919027e-05 6.096125e-05 0.003584229
## 18:            mean(num_985) 2.806379e-05 1.199098e-05 0.003584229

## [1]  train-error:0.027097 
## Will train until train_error hasn't improved in 10 rounds.
## 
## [2]  train-error:0.026183 
## [3]  train-error:0.026006 
## [4]  train-error:0.025992 
## [5]  train-error:0.025916 
## [6]  train-error:0.025806 
## [7]  train-error:0.025566 
## [8]  train-error:0.025562 
## [9]  train-error:0.025514 
## [10] train-error:0.025193 
## [11] train-error:0.025136 
## [12] train-error:0.025074 
## [13] train-error:0.024366 
## [14] train-error:0.024237 
## [15] train-error:0.024333 
## [16] train-error:0.024313 
## [17] train-error:0.024256 
## [18] train-error:0.024237 
## [19] train-error:0.024141 
## [20] train-error:0.024165 
## [21] train-error:0.024208 
## [22] train-error:0.023979 
## [23] train-error:0.023883 
## [24] train-error:0.023874 
## [25] train-error:0.023802
## [1] "test-error= 0.0244376004284949"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 81155  1824
##          1   366  6271
##                                           
##                Accuracy : 0.9756          
##                  95% CI : (0.9745, 0.9766)
##     No Information Rate : 0.9097          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8382          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.94485         
##                  Recall : 0.77468         
##                      F1 : 0.85134         
##              Prevalence : 0.09033         
##          Detection Rate : 0.06998         
##    Detection Prevalence : 0.07406         
##       Balanced Accuracy : 0.88509         
##                                           
##        'Positive' Class : 1               
## 

Logistic Regression

Logistic Regression Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 81193  3799
##          1   361  4262
##                                           
##                Accuracy : 0.9536          
##                  95% CI : (0.9522, 0.9549)
##     No Information Rate : 0.91            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.649           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.92191         
##                  Recall : 0.52872         
##                      F1 : 0.67203         
##              Prevalence : 0.08995         
##          Detection Rate : 0.04756         
##    Detection Prevalence : 0.05159         
##       Balanced Accuracy : 0.76215         
##                                           
##        'Positive' Class : 1               
## 

Decision Tree Classifier

Decision Tree Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 81156  2099
##          1   398  5962
##                                          
##                Accuracy : 0.9721         
##                  95% CI : (0.971, 0.9732)
##     No Information Rate : 0.91           
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8119         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##               Precision : 0.93742        
##                  Recall : 0.73961        
##                      F1 : 0.82685        
##              Prevalence : 0.08995        
##          Detection Rate : 0.06653        
##    Detection Prevalence : 0.07097        
##       Balanced Accuracy : 0.86737        
##                                          
##        'Positive' Class : 1              
## 

Conclusion

All three models provide pretty high accuracy rates, while XGBoost model performs best with 0.9756 accuracy.

According to the result of the confusion matrix, the XGBoost model predicts the probability of a customer churn with very good performance values. The F1, precision and recall scores show that this model is a good model for predicting churn. While accurately estimating 6271 of 8095 churned customers in the test data, it estimated 81155 of 81521 non-churned customers correctly.

Moreover, the model outperforms Decision Tree and Logistic Regression models according to the Recall rates.

From marketing perspective, this practice emphasize on higher recall rate, since the company wants to see who will not renew their membership. In that manner, we can conclude that XGBoost model would be the wise choice, if we’d like to predict which subscribers will churn in the next period.